Making Decision Trees Feasible in Ultrahigh Feature and Label Dimensions

نویسندگان

  • Weiwei Liu
  • Ivor W. Tsang
چکیده

Due to the non-linear but highly interpretable representations, decision tree (DT) models have significantly attracted a lot of attention of researchers. However, it is difficult to understand and interpret DT models in ultrahigh dimensions and DT models usually suffer from the curse of dimensionality and achieve degenerated performance when there are many noisy features. To address these issues, this paper first presents a novel data-dependent generalization error bound for the perceptron decision tree (PDT), which provides the theoretical justification to learn a sparse linear hyperplane in each decision node and to prune the tree. Following our analysis, we introduce the notion of budget-aware classifier (BAC) with a budget constraint on the weight coefficients, and propose a supervised budgeted tree (SBT) algorithm to achieve non-linear prediction performance. To avoid generating an unstable and complicated decision tree and improve the generalization of the SBT, we present a pruning strategy by learning classifiers to minimize cross-validation errors on each BAC. To deal with ultrahigh label dimensions, based on three important phenomena of real-world data sets from a variety of application domains, we develop a sparse coding tree framework for multi-label annotation problems and provide the theoretical analysis. Extensive empirical studies verify that 1) SBT is easy to understand and interpret in ultrahigh dimensions and is more resilient to noisy features. 2) Compared with state-of-the-art algorithms, our proposed sparse coding tree framework is more efficient, yet accurate in ultrahigh label and feature dimensions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

Ant Colony Optimization to Discover the Concealed Pattern in the Recruitment Process of an Industry

Recruitment of the most appropriate employees and their retention are the immense challenges for the HR department of most of the industries. Every year IT companies recruit fresh graduates through their campus selection programs. Usually industries examine the skills of the candidate by conducting tests, group discussion and number of interviews. This process requires enormous amount of effort...

متن کامل

Religion and Family Structure: Two Factors Affecting on Consumer Decision Making Styles in Iran

Purpose- The aim of this essay is to attempt to explain the impact of religion and family structure on consumer decision-making style within a Muslim country. This paper wants to demonstrate how and why husbands/wives with Eastern culture and Islamic norms use different decision-making styles. Design/methodology/approach- Literature reviews on consumer decision-making, religion and family struc...

متن کامل

Label Distribution Learning Forests

Label distribution learning (LDL) is a general learning framework, which assigns a distribution over a set of labels to an instance rather than a single label or multiple labels. Current LDL methods have either restricted assumptions on the expression form of the label distribution or limitations in representation learning. This paper presents label distribution learning forests (LDLFs) a novel...

متن کامل

A Study of the impact of Knowledge Management Infrastructures and Dimensions of Improvement of Managers' Decision-making on the Knowledge management status in the Iran Public Libraries Foundation

Purpose: To determine the impact of the knowledge management infrastructures and the dimensions of improvement of managers' decision-making on the status of knowledge management in the public libraries affiliated to the Iran Public Libraries Foundation. Method: This research is a descriptive-correlational study. The statistical sample of the research was selected by using the cluster random sa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Machine Learning Research

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2017